Querying the Deutsches Textarchiv

نویسندگان

  • Bryan Jurish
  • Christian Thomas
  • Frank Wiegand
چکیده

Historical document collections present unique challenges for information retrieval. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for conventional search architectures which typically rely on a static inverted index keyed by orthographic form. Additional steps must therefore be taken in order to improve recall, in particular for single-term bareword queries from nonexpert users. This paper describes the query processing architecture currently employed for full-text search of the historical German document collection of the Deutsches Textarchiv project.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finite-state canonicalization techniques for historical German

Acknowledgements There are a great many people who supported and influenced this work and to whom thanks are due: First, to my advisor Peter Staudacher, who despite (or perhaps because of) his professed ambivalence to impressing others has succeeded in impressing many of his students – myself included – with a taste for the formally rigorous study of natural language; and whose patience and rep...

متن کامل

Corpus Analysis based on Structural Phenomena in Texts: Exploiting TEI Encoding for Linguistic Research

This paper poses the question, how linguistic corpus-based research may be enriched by the exploitation of conceptual text structures and layout as provided via TEI annotation. Examples for possible areas of research and usage scenarios are provided based on the German historical corpus of the Deutsches Textarchiv (DTA) project, which has been consistently tagged accordant to the TEI Guidelines...

متن کامل

Developing a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information

With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...

متن کامل

Canonicalizing the deutsches Textarchiv

Virtually all conventional text-based natural language processing techniques – from traditional information retrieval systems to full-fledged parsers – require reference to a fixed lexicon accessed by surface form, typically trained from or constructed for synchronic input text adhering strictly to contemporary orthographic conventions. Unconventional input such as historical text which violate...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014